2024-12-24 de novo prelim

Introduction

3 days ago i was looking through the gtdbtk manual and saw that de_novo_wf was an option for analysis to create the trees, from the description given:

knitr::include_url("https://ecogenomics.github.io/GTDBTk/commands/de_novo_wf.html")

i beleived this would be something i should do as it might produce more accurate trees. sample 1Dt2d Enterobacter cancerogenus had been placed by the classify_wf in the previous gtdbtk analysis in the genus Pantoea, which lead me to this search. after a bit of trial and error, i produced this script

Methods

This ran as a slurm job on hawk (SCW) from rougly 20:10 on the 23rd to 01:00 on the 24th, totalling 4 hours and 50 minutes. The main parameters that i experimented with were

- #SBATCH --ntasks=5
- #SBATCH --time=24:00:00
- #SBATCH --mem=50g
- --cpus 10

I settled on these as being the “best”, however, it is entirely possible that they could be more optimised.

Results

This analysis produced these files:

/scratch/scw2160/02_outputs/flye_asm/gtdb_tk_de_novo5/
.:
text.txt
ls
touch
list.txt
align
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-table
gtdbtk.log
identify
infer
gtdbtk.warnings.log

./align:
gtdbtk.bac120.msa.fasta.gz
gtdbtk.bac120.user_msa.fasta.gz
gtdbtk.bac120.filtered.tsv

./identify:
gtdbtk.ar53.markers_summary.tsv
gtdbtk.bac120.markers_summary.tsv
gtdbtk.translation_table_summary.tsv
gtdbtk.failed_genomes.tsv

./infer:
gtdbtk.bac120.decorated.tree
gtdbtk.bac120.decorated.tree-taxonomy
gtdbtk.bac120.decorated.tree-table
intermediate_results

./infer/intermediate_results:
gtdbtk.bac120.rooted.tree
gtdbtk.bac120.fasttree.log
gtdbtk.bac120.tree.log
gtdbtk.bac120.unrooted.tree

I then moved this gtdbtk.bac120.decorated.tree file into Dendroscope for review, all 10 are on one tree, but 1Dt2d is still being placed in the “wrong” genus. on review of its sister accession on the ncbi database.

Conclusion

On the NCBI page for the sister accession, can be found a CheckM analysis that comes back with

completeness: 90%
contamination: 3.6%
Taxonomy check status: failed

Upon viewing the tree in Dendroscope, the joining node has the label 0.968. This I believe to be the probability the relationship is correct. this implies they are the same species, and the online sample is also identified as Enterobacter cancerogenus. However, due to the checkm analysis i find it plausible that they both have been misidentified and are in reality Pantoea species, i find this the most parsimonious explanation. I will follow this up with a CheckM analysis of my own on 1Dt2d

dendroscope screenshot showing location and relationship for 1Dt2d after de novo analysis
dendroscope screenshot showing location and relationship for 1Dt2d after de novo analysis

This was a “technical spike” or proof of concept for de_novo_wf

📌 TODO: do another Checkm analysis on 1Dt2d to see if the values are similar to the online sample

2024-12-25 🎄 begginning of table, checkm

introduction

i wanted to see if the outputs of checkm differed from checkm2 so ran that on hawk. I also began recreating the innital table for metadata about the bangor samples, in the spirit of automation, a less manual approach was chosen this time around.

methods

using this script i ran a slurm job on hawk under the lineage_wf of CheckM for all 10 Bangor-made samples, this took just 4 minutes. I also worked on exporting the data i want to tabulate off of hawk. The past way i did this was by manually entering each file and noting down the important characteristics. However, because there are going to be more samples(and i wanted to be clever) i decided to use a more automated process. This was done by identifying different documents in the flye directories on hawk, specifically files called “assembly_info.txt” which contain the same information, but are vastly more exportable. These are stored here: cd /scratch/scw2160/02_outputs/flye_asm/flye_asm_[accession]/ using this script i exported them off of hawk.

results

the CheckM analysis produced this output directory. My export script exported all 10 “assembly_info.txt” files to a directory in my home directory, as well as adding their accession ID to the name, this is important as otherwise i wouldnt know which belonged to what accession. I then brought them down and stored them here.

conclusions

with it being christmas i did not take a serious look at the significance of the outputs of either, so that is what i plan to do next so that i can have some conclusions, maybe by the end of tomorrow. In conclusion, this process is only half done and will continue into the following entry(s).

📌 TODO: Complete analysis / processing of checkm and table stuff